Part 1

Datasets and pre-processing steps

In this tutorial, we are going to mainly use Seurat package with publicly available datasets. Extensive tutorials with various contexts can be found in https://satijalab.org/seurat/.

Here, in the first part, we are going to analyze a single cell RNAseq dataset product by 10X Genomics and processed through Cell Ranger(TM) pipeline to generate barcode count matrices.

Please, download the “4k Peripheral blood mononuclear cells (PBMCs) from a Healthy Donor” data from ifx:/data/pub/bionano2018/scRNAseqWS.zip and unzip.

Data specific information:

The origin of the data is (http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_filtered_gene_bc_matrices.tar.gz) 4k PBMCs from a Healthy Donor*. Single Cell Gene Expression Dataset by Cell Ranger 2.1.0 with GRCh38

4,340 cells detected Sequenced on Illumina Hiseq4000 with approximately 87,000 reads per cell 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode Analysis run with –expect-cells=5000 Published on November 8, 2017

*This dataset is licensed under the Creative Commons Attribution license.

Loading data and initial quality checks

Seurat package provides a function for reading 10X datasets from a directory. This directory contains a matrix file (matrix.mtx) which stores UMI counts of genes for every cell in a sparse matrix format, a barcodes (barcodes.tsv) file which keeps the actual barcode sequences assigned to each cell, and a gene file (genes.tsv) for gene id/symbols from the transcriptome annotation.

library(Matrix)
library(Seurat)
library(dplyr)
library(SIMLR)

Read10X function reads the input files and stores them in a matrix with all information merged together.

pbmc.data <- Read10X(data.dir = "~/Downloads/scRNAseqWS/10Xdata/filtered_gene_bc_matrices/GRCh38/")

We, then, create a Seurat object file using the data matrix we just generated. The raw data is stored in the ‘raw.data’ slot of the Seurat object (pbmc@raw.data).

Filter cells and/or genes: min.cells = 3 : keep all genes expressed in >= 3 cells. min.genes = 200 : Keep all cells with at least 200 detected genes.

pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")

One important measure is the proportion of mitochondrial gene expression to overall expression because this gives us an idea about the sensitivity level of sampling other transcripts. This code below will calculate it for each cell and add them to the Seurat object metadata.

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)

percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

pbmc <- AddMetaData(object = pbmc, 
                    metadata = percent.mito, 
                    col.name = "percent.mito")

head(pbmc@meta.data)
##                  nGene nUMI orig.ident percent.mito
## AAACCTGAGAAGGCCT   748 1738   10X_PBMC   0.06386651
## AAACCTGAGACAGACC  1052 3240   10X_PBMC   0.05462963
## AAACCTGAGATAGTCA   739 1683   10X_PBMC   0.07367796
## AAACCTGAGCGCCTCA   875 2319   10X_PBMC   0.03839517
## AAACCTGAGGCATGGT   951 2983   10X_PBMC   0.02246061
## AAACCTGCAAGGTTCT  1248 4181   10X_PBMC   0.02224348

We can then check mitochondrial gene expression rate per cell as well as other two metrics internally calculated, nUMIs (Total UMI counts/cell) and nGene (number of genes detected).

VlnPlot(pbmc, features.plot = c("nGene", "nUMI", "percent.mito"))

Using a function from the package, GenePlot, it is possible to compare cells for their nUMI, nGene, and mito percent values.

Since there is a rare subset of cells with an outlier level of high mitochondrial percentage and also low UMI content, these can be used for filtration as well.

ggplot(pbmc@meta.data, aes(nUMI, nGene)) + geom_point()

ggplot(pbmc@meta.data, aes(nUMI, percent.mito)) + geom_point()

pbmc <- FilterCells(pbmc, 
                    subset.names = c("nGene", "percent.mito"), 
                    low.thresholds = c(200, -Inf), 
                    high.thresholds = c(2500, 0.05))

#Check to see if filtration works as we expected.
VlnPlot(pbmc, features.plot = c("nGene", "nUMI", "percent.mito"))

Normalization, Finding Variable Genes, and Scaling the data

The default normalization method provided is “Log normalization” which normalizes gene expression by cell total expression and multiplies by a scale factor (10,000) then log-transforms the value. The normalized values are then stored in the ‘data’ slot of the Seurat object (pbmc@data).

pbmc <- NormalizeData(pbmc, 
                      normalization.method = "LogNormalize", 
                      scale.factor = 10000)

Here, ‘FindVariableGenes’ function calculates variance and mean for every genes across all cells and sorts genes by their variance to mean ratios (VMR). We are going to select top 1000 genes as highly variable genes.

pbmc <- FindVariableGenes(pbmc, 
                          mean.function = ExpMean, 
                          dispersion.function = LogVMR, 
                          do.plot = FALSE,
                          display.progress = FALSE)

hv.genes <- head(rownames(pbmc@hvg.info), 1000)

Scaling function, ScaleData, is used to scale and center the expression values of each gene. This function also gives us an opportunity to regress out any unwanted variation from known sources (linear regression). Keep in mind that since the downstream analysis such as dimension reduction is done using only highly variable genes, we can scale the data using only ‘hv.genes’. The scaled expression values are then store in the ‘scale.data’ slot of the object (pbmc@scale.data).

pbmc <- ScaleData(pbmc, 
                  genes.use = hv.genes, 
                  vars.to.regress = c("nUMI", "percent.mito"),
                  display.progress = FALSE)
Dimension Reduction and Finding cell subtypes (with tSNE)

Here, we are performing a Principal Component Analysis (PCA) on the normalized and scaled expression data using highly variable genes.

pbmc <- RunPCA(pbmc, 
               pc.genes = hv.genes, 
               do.print = FALSE)

The first plot shows the first and second principal components.

PCAPlot(pbmc, dim.1 = 1, dim.2 = 2)

This second plot demonstrates the standard deviation explained by each PC. We are going to include PCs up to where the graph makes a kink.

PCElbowPlot(pbmc)

Cluster cells is done using Shared Nearest Neighbor (SNN) method. First find k-nearest neighbors for every cell, then, construct a SNN graph. For more information about the algorithms, read Waltman and van Eck (2013) The European Physical Journal B. Each cell (labelled with their barcode sequences) are assing to a cluster id:

pbmc <- FindClusters(pbmc, 
                     reduction.type = "pca", 
                     dims.use = 1:10, 
                     resolution = 0.6, 
                     print.output = TRUE, 
                     save.SNN = TRUE)

head(pbmc@ident,20)
## AAACCTGAGCGCCTCA AAACCTGAGGCATGGT AAACCTGCAAGGTTCT AAACCTGCATCCCATC 
##                4                0                4                1 
## AAACCTGCATGAAGTA AAACCTGGTACATCCA AAACCTGGTGCGGTAA AAACCTGTCGTGGTCG 
##                2                1                0                0 
## AAACCTGTCTCTGCTG AAACGGGAGGCTAGCA AAACGGGAGTGTCCAT AAACGGGGTCTTCTCG 
##                4                4                5                2 
## AAACGGGGTGGACGAT AAACGGGGTTTGCATG AAACGGGTCTGGGCCA AAACGGGTCTGGTATG 
##                3                2                5                4 
## AAAGATGAGACATAAC AAAGATGGTCCCTACT AAAGATGTCCGAATGT AAAGATGTCTGGTTCC 
##                4                1                5                2 
## Levels: 0 1 2 3 4 5 6 7 8 9

For visualizing clustered cells, we are going to use tSNE plot. When running tSNE we should be using the same PCs as we used previously in order to get the same clusters.

pbmc <- RunTSNE(pbmc, dims.use = 1:10, do.fast = TRUE)

TSNEPlot(pbmc, do.label = TRUE)

Detecting marker genes of cell clusters

Finally, here, we are going to determine differentially expressed genes unique to each cluster. The ‘FindAllMarkers’ function takes the expression of each gene in one cluster and compares against to all other clusters. By default, statistical test used is ‘Wilcoxon rank sum test’; however, there are multiple other options including DESeq2. Here, we are further constrains such as ‘min.pct = 0.25’ meaning that it will test only genes expressed in at least 25% of the cells in the cluster, etc.

pbmc.markers <- FindAllMarkers(pbmc, 
                               only.pos = TRUE, 
                               min.pct = 0.25, 
                               logfc.threshold = 0.25)

top5.markers <- pbmc.markers %>% group_by(cluster) %>% top_n(5, avg_logFC)

best.markers <- pbmc.markers %>% group_by(cluster) %>% top_n(1, avg_logFC)

FeaturePlot(object = pbmc, 
            features.plot = best.markers$gene, 
            cols.use = c("grey", "blue"), 
            reduction.use = "tsne")

Part 2

Second Part: Working with multiple scRNAseq datasets

Alignment workflow for the four mouse brain datasets

Zeisel et al: Single-cell RNA-seq of mouse cerebral cortex Tasic et al: Adult mouse cortical cell taxonomy by single cell transcriptomics Romanov et al: Single-cell RNA-seq of mouse hypothalamus Marques et al: RNA-seq analysis of single cells of the oligodendrocyte lineage from nine distinct regions of the anterior-posterior and dorsal-ventral axis of the mouse juvenile central nervous system

The raw requencing reads were downloaded public SRA database and processed using Kallisto quantification pipeline.

Loading datasets and QC (with PCA)

We start by loading the expression data and the metadata of each study.

ob.list <- list("zeisel", "romanov", "tasic", "marques")

#Load the expression and meta data of each study.

for (i in 1:length(ob.list)){
  obj.data <- paste(ob.list[[i]],".data",sep=""); 
  #Read the expression matrix from a text file for each dataset.
  assign(obj.data, read.delim(paste("~/Downloads/scRNAseqWS/",ob.list[[i]],".expression.txt",sep=""), header = TRUE, row.names = 1))
  }

#Since the expression matrices of these datasets are in TPM (already normalized), we are going to skip NormalizeData step in the following steps. However, we still need to log-transform it. log1p = log(1+x), natural log.
zeisel.data <- log1p(zeisel.data)
romanov.data <- log1p(romanov.data)
tasic.data <- log1p(tasic.data)
marques.data <- log1p(marques.data)

for (i in 1:length(ob.list)){
  obj.meta <- paste(ob.list[[i]],".meta",sep=""); 
  #Reading the Run information meta data from a text file for each dataset.
  assign(obj.meta, read.delim(paste("~/Downloads/scRNAseqWS/",ob.list[[i]],".RunTable.txt",sep=""), header = TRUE))
}

Check the PCA and tSNE prior to alignment:

rownames(zeisel.meta) <- zeisel.meta$Run_s
rownames(romanov.meta) <- romanov.meta$Run_s
rownames(tasic.meta) <- tasic.meta$Run_s
rownames(marques.meta) <- marques.meta$Run_s

batches <- rbind(zeisel.meta[,c("Run_s","Owner","SRA_Study_s")],
                 tasic.meta[,c("Run_s","Owner","SRA_Study_s")],
                 romanov.meta[,c("Run_s","Owner","SRA_Study_s")],
                 marques.meta[,c("Run_s","Owner","SRA_Study_s")])

combined.data <- cbind(zeisel.data, tasic.data, romanov.data, marques.data)

combined.data <- as(as.matrix(combined.data), "dgCMatrix")

fourDataset <- CreateSeuratObject(raw.data = combined.data, 
                                  project = "4dataset.Pre")

fourDataset <- AddMetaData(fourDataset, metadata = batches)

fourDataset <- FilterCells(fourDataset, 
                           subset.names = "nGene", 
                           low.thresholds = 2500)

fourDataset <- FindVariableGenes(fourDataset, 
                                 do.plot = F, 
                                 display.progress = F)

fourDataset <- ScaleData(fourDataset, display.progress = F)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
fourDataset <- RunPCA(fourDataset, 
                      pc.genes = fourDataset@var.genes, 
                      pcs.compute = 5, 
                      do.print = FALSE)

PCAPlot(fourDataset, pt.size=1, group.by ="Owner", dim.1 = 1, dim.2 = 2)

fourDataset <- RunTSNE(fourDataset, 
                       reduction.use = "pca", 
                       dims.use = 1:5)

TSNEPlot(fourDataset, do.label = T, group.by ="Owner")

#Subset the data
zeisel <- SubsetData(fourDataset, 
                     cells.use=names(zeisel.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
tasic <- SubsetData(fourDataset, 
                    cells.use=names(tasic.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
romanov <- SubsetData(fourDataset, 
                      cells.use=names(romanov.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
marques <- SubsetData(fourDataset, 
                      cells.use=names(marques.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
Outline of the alignment process
  1. Select highly variable genes shared by at least two datasets,
  2. Identify shared correlation structures (cannonical correlation vectors) across datasets,
  3. Align these dimensions using dynamic time wrapping,
  4. Use cells embedded into low-dimensional space for clustering.
Finding common variable genes between multiple datasets

Determine genes to use for CCA, must be highly variable in at least 2 datasets

ob.list <- list(zeisel, romanov, tasic, marques)

genes.use <- c()
for (i in 1:length(ob.list)) {
  genes.use <- c(genes.use, head(rownames(ob.list[[i]]@hvg.info), 1000))
}
genes.use <- names(which(table(genes.use) > 1))
for (i in 1:length(ob.list)) {
  genes.use <- genes.use[genes.use %in% rownames(ob.list[[i]]@scale.data)]
}
Multi-dataset alignment with CCA

Run multi-set CCA

mouseBrain.integrated <- RunMultiCCA(ob.list, genes.use = genes.use, num.ccs = 15)
## [1] "Computing CC 1"
## [1] "Computing CC 2"
## [1] "Computing CC 3"
## [1] "Computing CC 4"
## [1] "Computing CC 5"
## [1] "Computing CC 6"
## [1] "Computing CC 7"
## [1] "Computing CC 8"
## [1] "Computing CC 9"
## [1] "Computing CC 10"
## [1] "Computing CC 11"
## [1] "Computing CC 12"
## [1] "Computing CC 13"
## [1] "Computing CC 14"
## [1] "Computing CC 15"
## Scaling data matrix

Run rare non-overlapping filtering

mouseBrain.integrated <- CalcVarExpRatio(mouseBrain.integrated,
                                         reduction.type = "pca",
                                         grouping.var = "Owner", 
                                         dims.use = 1:10)

mouseBrain.integrated <- SubsetData(mouseBrain.integrated, 
                                    subset.name = "var.ratio.pca",
                                    accept.low = 0.5)

Alignment:

mouseBrain.integrated <- AlignSubspace(mouseBrain.integrated,
                                       reduction.type = "cca",
                                       grouping.var = "Owner",
                                       dims.align = 1:10)
Post-Alignment cell clustering

t-SNE and Clustering

mouseBrain.integrated <- FindClusters(mouseBrain.integrated,
                                      reduction.type = "cca.aligned",
                                      dims.use = 1:10, save.SNN = T,
                                      resolution = 0.4)

mouseBrain.integrated <- RunTSNE(mouseBrain.integrated,
                                 reduction.use = "cca.aligned",
                                 dims.use = 1:10,
                                 check_duplicates = FALSE)
# Visualization
TSNEPlot(mouseBrain.integrated, do.label = T)

TSNEPlot(mouseBrain.integrated, do.label = T, group.by ="Owner")

Exploring the alternative Clustering options with SIMLR
library(SIMLR)

set.seed(11111)
#zeisel.data is already in log.
# Determine optimal number of clusters as described in the Nat. Methods paper
# picka cluster range and reports two metrics; the lower the value the more
# support for that number of clusters; in my limited experience these methods
# are concordant.

zclust<-SIMLR_Estimate_Number_of_Clusters(zeisel.data, NUMC=2:5)

#run SIMLR
zsimlr<-SIMLR(zeisel.data, 4)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration:  1 
## Iteration:  2 
## Iteration:  3 
## Iteration:  4 
## Iteration:  5 
## Iteration:  6 
## Iteration:  7 
## Iteration:  8 
## Iteration:  9 
## Iteration:  10 
## Iteration:  11 
## Iteration:  12 
## Iteration:  13 
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  0.1880229 
## Epoch: Iteration # 200  error is:  0.1628646 
## Epoch: Iteration # 300  error is:  0.1566613 
## Epoch: Iteration # 400  error is:  0.1535306 
## Epoch: Iteration # 500  error is:  0.1516693 
## Epoch: Iteration # 600  error is:  0.1503478 
## Epoch: Iteration # 700  error is:  0.1492706 
## Epoch: Iteration # 800  error is:  0.1484344 
## Epoch: Iteration # 900  error is:  0.1477013 
## Epoch: Iteration # 1000  error is:  0.1470672 
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  12.09497 
## Epoch: Iteration # 200  error is:  0.2800904 
## Epoch: Iteration # 300  error is:  0.2665204 
## Epoch: Iteration # 400  error is:  0.262567 
## Epoch: Iteration # 500  error is:  0.2604442 
## Epoch: Iteration # 600  error is:  0.2590736 
## Epoch: Iteration # 700  error is:  0.258107 
## Epoch: Iteration # 800  error is:  0.2573748 
## Epoch: Iteration # 900  error is:  0.256798 
## Epoch: Iteration # 1000  error is:  0.2563259
# Create plotting function, color-coding points by cluster membership
plotSIMLRclusters <- function(obj) {                                                                                                                                                           
    col <- ifelse(obj$y$cluster==1, 'red', ifelse(obj$y$cluster==2, 'blue',
                                                     ifelse(obj$y$cluster==3, 'green','yellow')))    
    plot(obj$ydata, col=col, xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch=20,  cex=0.7) 
}

# Call plotting function
plotSIMLRclusters(zsimlr)

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin17.5.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 SIMLR_1.6.0    dplyr_0.7.6    Seurat_2.3.3  
## [5] cowplot_0.9.3  ggplot2_3.0.0  Matrix_1.2-14 
## 
## loaded via a namespace (and not attached):
##   [1] diffusionMap_1.1-0   Rtsne_0.13           colorspace_1.3-2    
##   [4] class_7.3-14         modeltools_0.2-22    ggridges_0.5.0      
##   [7] mclust_5.4.1         rprojroot_1.3-2      htmlTable_1.12      
##  [10] base64enc_0.1-3      rstudioapi_0.7       proxy_0.4-22        
##  [13] flexmix_2.3-14       bit64_0.9-7          RSpectra_0.13-1     
##  [16] mvtnorm_1.0-8        codetools_0.2-15     splines_3.5.0       
##  [19] R.methodsS3_1.7.1    robustbase_0.93-1    knitr_1.20          
##  [22] Formula_1.2-3        jsonlite_1.5         ica_1.0-2           
##  [25] cluster_2.0.7-1      kernlab_0.9-26       png_0.1-7           
##  [28] R.oo_1.22.0          compiler_3.5.0       backports_1.1.2     
##  [31] assertthat_0.2.0     lazyeval_0.2.1       lars_1.2            
##  [34] acepack_1.4.1        htmltools_0.3.6      tools_3.5.0         
##  [37] igraph_1.2.1         gtable_0.2.0         glue_1.2.0          
##  [40] RANN_2.6             reshape2_1.4.3       Rcpp_0.12.17        
##  [43] trimcluster_0.1-2    gdata_2.18.0         ape_5.1             
##  [46] nlme_3.1-137         iterators_1.0.10     fpc_2.1-11          
##  [49] lmtest_0.9-36        stringr_1.3.1        irlba_2.3.2         
##  [52] gtools_3.8.1         DEoptimR_1.0-8       MASS_7.3-50         
##  [55] zoo_1.8-3            scales_0.5.0         doSNOW_1.0.16       
##  [58] parallel_3.5.0       RColorBrewer_1.1-2   yaml_2.1.19         
##  [61] reticulate_1.9       pbapply_1.3-4        gridExtra_2.3       
##  [64] rpart_4.1-13         segmented_0.5-3.0    latticeExtra_0.6-28 
##  [67] stringi_1.2.3        foreach_1.4.4        checkmate_1.8.5     
##  [70] caTools_1.17.1       SDMTools_1.1-221     rlang_0.2.1         
##  [73] pkgconfig_2.0.1      dtw_1.20-1           prabclus_2.2-6      
##  [76] bitops_1.0-6         pracma_2.1.4         evaluate_0.10.1     
##  [79] lattice_0.20-35      ROCR_1.0-7           purrr_0.2.5         
##  [82] bindr_0.1.1          labeling_0.3         htmlwidgets_1.2     
##  [85] bit_1.1-14           tidyselect_0.2.4     RcppAnnoy_0.0.10    
##  [88] plyr_1.8.4           magrittr_1.5         R6_2.2.2            
##  [91] snow_0.4-2           gplots_3.0.1         Hmisc_4.1-1         
##  [94] pillar_1.3.0         foreign_0.8-70       withr_2.1.2         
##  [97] fitdistrplus_1.0-9   mixtools_1.1.0       survival_2.42-6     
## [100] scatterplot3d_0.3-41 nnet_7.3-12          tibble_1.4.2        
## [103] tsne_0.1-3           crayon_1.3.4         hdf5r_1.0.0         
## [106] KernSmooth_2.23-15   rmarkdown_1.10       grid_3.5.0          
## [109] data.table_1.11.4    metap_0.9            digest_0.6.15       
## [112] diptest_0.75-7       tidyr_0.8.1          R.utils_2.6.0       
## [115] stats4_3.5.0         munsell_0.5.0